9 research outputs found

    SourcererCC: Scaling Code Clone Detection to Big Code

    Full text link
    Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised

    Towards Automating Precision Studies of Clone Detectors

    Full text link
    Current research in clone detection suffers from poor ecosystems for evaluating precision of clone detection tools. Corpora of labeled clones are scarce and incomplete, making evaluation labor intensive and idiosyncratic, and limiting inter tool comparison. Precision-assessment tools are simply lacking. We present a semi-automated approach to facilitate precision studies of clone detection tools. The approach merges automatic mechanisms of clone classification with manual validation of clone pairs. We demonstrate that the proposed automatic approach has a very high precision and it significantly reduces the number of clone pairs that need human validation during precision experiments. Moreover, we aggregate the individual effort of multiple teams into a single evolving dataset of labeled clone pairs, creating an important asset for software clone research.Comment: Accepted to be published in the 41st ACM/IEEE International Conference on Software Engineerin

    Large-Scale Code Clone Detection

    No full text
    Clone detection locates exact or similar pieces of code, known as clones, within or between software systems. With the amount of source code increasing steadily, large-scale clone detection has become a necessity. Large code bases and repositories of projects have led to several new use cases of clone detection including mining library candidates, detecting similar mobile applications, detection of license violations, reverse engineering product lines, finding the provenance of a component, and code search. While several techniques have been proposed for clone detection over many years, accuracy and scalability of clone detection tools and techniques still remains an active area of research. Specifically, there is a marked lack in clone detectors that scale to large systems or repositories, particularly for detecting near-miss clones where significant editing activities may have taken place in the cloned code. The problem stated above motivates the need for clone detection techniques and tools that satisfy the following requirements: (1) accurate detection of near-miss clones, where minor to significant editing changes occur in the copy/pasted fragments; (2) scalability to hundreds of millions of lines of code and several thousand projects; and (3) minimal dependency on programming languages. To that effect, this dissertation presents SourcererCC, an accurate, near-miss clone detection tool that scales to hundreds of millions of lines of code (MLOC) on a single standard machine. The core idea of SourcererCC is to build an optimized index of code blocks and compare them using a simple bag-of-tokens strategy, which is very effective in detecting near-miss clones. Coupled with several filtering heuristics that reduce the size of the index, this approach is also very efficient, as it reduces the number of code block comparisons to detect the clones. This dissertation evaluates scalability, execution time, and accuracy of SourcererCC against four state-of-the-art open-source tools: CCFinderX, Deckard, iClones, and NiCad. To measure scalability, the performance of the tools is evaluated on inter-project software repository IJaDataset-2.0, consisting of 25,000 projects, containing 3 million files and 250 MLOC. To measure precision and recall, two recent benchmarks are used: (1) a benchmark of real clones, BigCloneBench, that spans the four primary clone types and the full spectrum of syntactical similarity in three different languages (Java, C, and C#); and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. The results of these experiments suggest that SourcererCC improves the state-of-the-art in code clone detection by being the most scalable technique known so far, with accuracy at par with the current state-of-the-art tools.Additionally, this dissertation presents two tools built on top of SourcererCC: (i) SourcererCC-D: a distributed version of SourcererCC that exploits the inherent parallelism present in SourcererCC's approach to scale horizontally on a cluster of commodity machines for large scale code clone detection. Our experiments demonstrate SourcererCC-D's ability to achieve ideal speed-up and near linear scale-up on large datasets; and (ii) SourcererCC-I: an interactive and real-time version of SourcererCC that is integrated with the Eclipse development environment. SourcererCC-I is built to support developers in clone-aware development and maintenance activities. Finally, this dissertation concludes by presenting two empirical studies conducted using SourcererCC to demonstrate its effectiveness in practice

    Trendy bugs: Topic trends in the Android bug reports

    No full text
    Abstract—Studying vast volumes of bug and issue discussions can give an understanding of what the community has been most concerned about, however the magnitude of documents can overload the analyst. We present an approach to analyze the development of the Android open source project by observing trends in the bug discussions in the Android open source project public issue tracker. This informs us of the features or parts of the project that are more problematic at any given point of time. In turn, this can be used to aid resource allocation (such as time and man power) to parts or features. We support these ideas by presenting the results of issue topic distributions over time using statistical analysis of the bug descriptions and comments for the Android open source project. Furthermore, we show relationships between those time distributions and major development releases of the Android OS. Keywords-bug logs; Android; topics; statistical trend analysis I

    Semantic code search via equational reasoning

    No full text
    © 2020 Owner/Author. We present a new approach to semantic code search based on equational reasoning, and the Yogo tool implementing this approach. Our approach works by considering not only the dataflow graph of a function, but also the dataflow graphs of all equivalent functions reachable via a set of rewrite rules. In doing so, it can recognize an operation even if it uses alternate APIs, is in a different but mathematically-equivalent form, is split apart with temporary variables, or is interleaved with other code. Furthermore, it can recognize when code is an instance of some higher-level concept such as iterating through a file. Because of this, from a single query, Yogo can find equivalent code in multiple languages. Our evaluation further shows the utility of Yogo beyond code search: encoding a buggy pattern as a Yogo query, we found a bug in Oracle's Graal compiler which had been missed by a hand-written static analyzer designed for that exact kind of bug. Yogo is built on the Cubix multi-language infrastructure, and currently supports Java and Python
    corecore